Kazutoshi KOBAYASHI Masanao YAMAOKA Yukifumi KOBAYASHI Hidetoshi ONODERA Keikichi TAMARU
We propose a functional memory for addition (FMA), which is a memory-merged logic LSI. It is a memory as well as a SIMD parallel processor. To minimize the area, a precessing element (PE) consists of several DRAM words and a bit-serial ALU. The ALU has a functionality of addition bit by bit. This paper describes two FMA experimental LSIs. One is for general purpose, and the other is for full search block matching of image compression. We estimate that a 0.18 µm process realizes 57,000 PEs in a 50 mm2 die, achieving 205 GOPS under 1.36 W power.
Jar-Ferr YANG Shu-Sheng HAO Wei-Yuan LU
In this paper, we propose fast block matching criteria to reduce the implementation complexity of motion estimation in VLSI video coders. Based on generalized quantization of pixel difference measures, the block matching criteria combined with bitmap exclusive-OR (XOR) concept can be realized by short length adders and a multi-input binary counter. The proposed approach can be treated as a generalization of the pixel difference classification (PDC) criterion. Simulation results show that the proposed block-matching criteria along with various block search algorithms achieve better results than the PDC and obtain nearly the same performance as the mean absolute difference (MAD) criterion. However, the complete gate-level synthesis of the proposed matching criterion is much less than those of the MAD and the PDC in the VLSI implementation.
This paper gives a detailed presentation of a "vision chip" for a very fast detection of motion vectors. The chip's design consists of a parallel pixel array and column parallel block-matching processors. Each pixel of the pixel array contains a photo detector, an edge detector and 4 bits of memory. In the detection of motion vectors, first, the gray level image is binarized by the edge detector and subsequently the binary edge data is used in the block matching processor. The block-matching takes place locally in pixel and globally in column. The chip can create a dense field of motion where a vector is assigned to each pixel by overlapping 2 2 target blocks. A prototype with 16 16 pixels and four block-matching processors has been designed and implemented. Preliminary results obtained by the prototype are shown.
In this paper, we present two fast motion estimation techniques with adaptive variable search range using spatial and temporal correlation of moving pictures respectively. The first technique uses a frame difference between two adjacent frames which is used as a criterion for deciding search window size. The second one uses deviation between the past and the predicted current frame motion vectors which is also used as a criterion for deciding search window size. Simulation results show that these methods reduce the number of checking points while keeping almost the same image quality as that of full search method.
Yankang WANG Yanqun WANG Hideo KURODA
This paper presents a novel approach to pixel decimation for motion estimation in video coding. Early techniques of pixel decimation use regular pixel patterns to evaluate matching criterion. Recent techniques use adaptive pixel patterns and have achieved better efficiency. However, these adaptive techniques require an initial division of a block into a set of uniform regions and therefore are only locally-adaptive in essence. In this paper, we present a globally-adaptive scheme for pixel decimation, in which no regions are fixed at the beginning and pixels are selected only if they have features important to the determination of a match. The experiment results show that when no more than 40 pixels are selected out of a 1616 block, this approach achieves a better search accuracy by 13-22% than the previous locally-adaptive methods which also use features.
Yankang WANG Yanqun WANG Hideo KURODA
Conventional fast block-matching algorithms, such as TSS and DSWA/IS, are widely used for motion estimation in the low-bit-rate video coding. These algorithms are based on the assumption that when searching in the previous frame for the block that best matches a block in the current frame, the difference between them increases monotonically when a matching block moves away from the optimal solution. Unfortunately, this assumption of global monotonicity is often not valid, which can lead to a high possibility for the matching block to be trapped to local minima. On the other hand, monotonicity does exist in localized areas. In this paper, we proposed a new algorithm called Peano-Hilbert scanning search algorithm (PHSSA). With the Peano-Hilbert image representation, the assumption of global monotonicity is not necessary, while local monotonicity can be effectively explored with binary search. PHSSA selects multiple winners at each search stage, minimizing the possibility of the result being trapped to local minima. The algorithm allows selection of three parameters to meet different search accuracy and process speed: (1) the number of initial candidate intervals, (2) a threshold to remove the unpromising candidate intervals at each stage, and (3) a threshold to control when interval subdivision stops. With proper parameters, the multiple-candidate PHSSA converges to the optimal result faster and with better accuracy than the conventional block matching algorithms.
Seunghwan LEE Masanori HARIYAMA Michitaka KAMEYAMA
Three-dimensional (3-D) instrumentation using an image sequence is a promising instrumentation method for intelligent systems in which accurate 3-D information is required. However, real-time instrumentation is difficult since much computation time and a large memory bandwidth are required. In this paper, a 3-D instrumentation VLSI processor with a concurrent memory-access scheme is proposed. To reduce the access time, frequently used data are stored in a cache register array and are concurrently transferred to processing elements using simple interconnections to the 8-nearest neighbor registers. Based on a row and column memory access pattern, we propose a diagonally interleaved frame memory by which pixel values of a row and column are stored across memory modules. Based on the concurrent memory-access scheme, a 40 GOPS vprocessor is designed and the delay time for the instrumentation is estimated to be 42 ms for a 256256 images.
Han-Kyu LEE Jae-Yeal NAM Jin-Soo CHOI Yeong-Ho HA
Full-search block-matching motion estimation is a popular method to reduce temporal redundancies in video sequence. Due to its excessive computational load, parallel processing architectures are often required for real-time processing. One of the architectures is Hsieh's architecture based on systolic array processor and shift register arrays. Serial input characteristic of his scheme can reduce the number of pixel inputs to one, at the expense of significantly increasing the initialization time. This paper presents a modified and generalized Hsieh's architecture to reduce the initialization time. The proposed architecture can easily control data flows by rearranging shift register arrays and input-pin counts by using multiplexers on input stage, while retaining good properties of Hsieh's. The proposed architecture has the following advantages: (1) it allows controllable data inputs to save the pin counts, (2) it is flexible to the dimensional change of the search area via simple control, (3) it can operate in real time for video conference applications, and (4) it has simple and modular structure which is quite suitable for VLSI implementation. For verification of the proposed architecture, VHDL simulations are performed and some results are given.
A new Elastic-Block Matching Algorithm using bilinear space warping is proposed. In this scheme a convex quadrilateral, which minimizes a distortion measure against the current square block, is searched to compensate the shape deformation caused by a rigid body's 3 dimensional depth motion or rotation. The proposed algorithm gives a remarkable improvement in motion-compensated prediction compared with the conventional algorithm.